arXiv: 1507.02119v2 [q-bio.PE] 11 Nov 2015 


Locating a Tree in a Reticulation-Visible Network in 

Cubic Time 


Andreas D.M. Gunawan* Bhaskar DasGuptaj Louxin 



Abstract 


In phylogenetics, phylogenetic trees are rooted binary trees, whereas phylogenetic 
networks are rooted acyclic digraphs. Edges are directed away from the root and 
leaves are uniquely labeled with taxa in phylogenetic networks. For the purpose of 
evolutionary model validation, biologists check whether or not a phylogenetic tree is 
contained in a phylogenetic network on the same taxa. Such a tree containment problem 
is NP-complete. A phylogenetic network is reticulation-visible if every reticulation node 
separates the network root from some leaves. We answer an open problem by proving 
that the problem is solvable in cubic time for reticulation-visible phylogenetic networks. 

The key gadget used in our answer can also allow us to design a linear-time algorithm 
for the cluster containment problem for networks of this type and to prove that every 
galled network with n leaves has 2(n — 1) reticulation nodes at most. 

1 Introduction 

How life came to existence and evolved has been a key scientific question in the past hundreds 
of years. Traditionally, a (phylogenetic) tree has been used to model the evolutionary history 
of species, in which a node represents a speciation event and the leaves represent the extant 
species under study. Such evolutionary trees are often reconstructed from the gene or protein 
sequences sampled from the extant species under study. Since genomic studies have demon¬ 
strated that genetic material is often transfered between organisms in a non-reproductive 
manner mm, it has been commonly accepted that (phylogenetic) networks are more suit¬ 
able for modeling horizontal gene transfer, introgression, recombination and hybridization 
events in genome evolution [a El El HZIEH]. Mathematically, a network is a rooted acyclic 
digraph with labeled leaves. Algorithmic and combinatorial aspects of networks have been 
intensively studied in the past two decades piniEi]. 

An important issue is to check the “consistency” of two evolutionary models. A somewhat 
simpler (but nonetheless very important) version of this issue asks whether a given network 
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is consistent with an existing tree model or not. This motivates researchers to study the 
problem of determining whether a tree is displayed by a network or not, called the tree 
containment problem (TCP). The cluster containment problem (CCP) is another related 
algorithmic problem that asks whether or not a subset of taxa is a cluster in a tree displayed 
by a network. Both TCP and CCP have also been investigated in the development of network 
metrics mm 

The TCP and CCP are NP-complete [13] , even on a very restricted class of networks [20] . 
van lersel, Semple and Steel posed an open problem whether or not the TCP is solvable in 
polynomial time for reticulation-visible networks [g na EQ]. The visibility property was 
originally introduced to capture an important feature of galled networks m- A network is 
reticulation-visible if every reticulation node separates the network root from some leaves. 
Real network models are likely reticulation-visible (see [TB] for example). Although great 
effort has been devoted to the study of the TCP, it has been shown to be polynomial-time 
solvable only for a couple of very restricted subclasses of reticulation-visible networks HIED]. 
Other studies related to the TCP include [T5] . 

In this paper, we make three contributions. We give an affirmative answer to the open 
problem by presenting a cubic time algorithm for the TCP for binary reticulation-visible net¬ 
works. Additionally, we present a linear-time algorithm for the CCP for binary reticulation- 
visible networks. These two algorithms are further modihed into polynomial time algorithms 
for non-binary reticulation-visible networks. Our algorithms rely on an important decom¬ 
position theorem, which is proved in Section 4. Empowered by it, we also prove that any 
arbitrary galled network with n leaves has 2(n — 1) reticulation nodes at most. 

The rest of the paper is organized as follows. Section [^ introduces basic concepts and 
notation. Section lists our main results (Theorems 3.1 and 3.2) and gives a brief sum¬ 


mary of algorithmic methodologies that lead us to the results. In Section we present 
a decomposition theorem (Theorem 4.1) that reveals an important structural property of 
reticulation-visible networks, based on which the two main theorems are respectively proved 
in Section]^ and Appendix 3 (due to page limitation). 


2 Basic Concepts and Notation 

2.1 Phylogenetic networks 

In phylogenetics, networks are acyclic digraphs in which a unique node (the root) has a 
directed path to every other node and the nodes of indegree one and outdegree zero (called 
the leaves) are uniquely labeled. Leaves represent bio-molecular sequences, extant organisms 
or species under study. In this paper, we also assume that each non-root node in a network 
is of either indegree one or outdegree one. A node is called a reticulation node if its indegree 
is strictly greater than one and its outdegree is precisely one. Reticulation nodes represent 
reticulation events occurring in evolution. Non-reticulation nodes are called tree nodes, which 
include the root and leaves. 

For convenience in describing the algorithms and proofs, we add an open incoming edge 
to the root so that its degree is also 3 (Figure [^. A network is binary if leaves have degree 
1 and all other nodes have degree 3 in the network. 
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Let be a network. We use the following notation: 

• p{N) is the root of N. 

• C{N) is the set of all leaves in N. 

• TZ{N) is the set of all reticulation nodes in N. 

• T{N) is the set of all tree nodes of degree 3 in N. 

• V{N) = TZ{N) U T{N), which is the set of all nodes in N. 

• S{N) is the set of all edges in N. 

• For two nodes u,v in N: 

— M is a parent of v or alternatively u is a child of u if {u,v) is a directed edge in 
iV, and 

— M is an ancestor of v or alternatively n is a descendant of u if there is a directed 
path from u io v. In this case, we also say u is below v. 

• p{u) is the set of the parents of m G 71{N) or the unique parent of m G T(iV)\ {p{N)}. 

• c{u) is the set of the children for u G T(A^) or the unique child for u G TZ{N). 

• T>n{u) is the subnetwork vertex-induced by m G V(iV) and all descendants of u. 

• For any E C S{N), N — E is the subnetwork of N with the (same) node set V{N) and 
the edge set S{N)\E. 

• For any subset V of nodes of — V" is the subnetwork of N with the node set 

V{N)\V and the edge set {(x, y) G £{N) \ x ^V,y ^V}. 

2.2 The visibility property 

Let A^ be a network and u,v E V{N). We say that u is visible (or stable) on v if every path 
from the root p{N) to v must contain u [TT] falso see [121 P- 165]). In computer science, u is 
called a dominator of n if m is visible on v [H]. 

A reticulation node is visible if it is visible on some leaf. A network is reticulation-visible 
if every reticulation node is visible. In other words, each reticulation node separates the root 
from some leaves in a reticulation-visible network. 

The phylogenetic network in Figure is reticulation-visible. Clearly, all trees are 
reticulation-visible, as they do not contain any reticulation nodes. In fact, the widely studied 
tree-child networks and galled networks are also reticulation-visible [2l 1^. 

2.3 The TCP and CCP 

Suppression of a node of indegree and outdegree one means that the node is removed and 
the two edges incident to it are merged into an edge with the same orientation between the 
two neighbors of it. A tree T' is called a subdivision of another tree T if T can be obtained 
from T' by the suppression of some nodes of indegree and outdegree one in T'. 

Consider a binary network N in which each reticulation node has two incoming and one 
outgoing edges. Thus, the removal of one incoming edge for each reticulation node results 
in a directed tree. However, there may exist new (dummy) leaves in the obtained tree. For 
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Figure 1: The binary network in panel A displays the tree in panel D through the removal 
of four edges 61 , 62 , {x,v), {x,u) (B) and the node x (C). Here, the reticulation nodes are 
represented by shaded circles. 

example, after removing 61 , 62 , (x, n), and (x, u) in the network given in Figure]^, we obtain 
the tree shown in Figure in which x is a new leaf besides the original leaves f'i (1 < ? < 4). 

If the obtained tree contains such dummy leaves, we will have to remove them and some of 
their ancestors to obtain a subtree having the same set of leaves as N. 

Definition 2.1 (Tree Containment) Let N be a network. For eachu G V{N), d\^ denotes 
the indegree of u. N displays {or contains) a phylogenetic tree T over the same taxa {that 
is C{N) = C{T)) if E <Z £{N) and V C V{N) exist such that (i) E contains exactly — 1 
incoming edges for each u G 1Z{N), and (ii) N — E — V is a subdivision ofT. 

Because of the existence of dummy leaves, V may be nonempty to guarantee that N — 

E — V has the same set of leaves as T. Note that a binary network with k reticulation nodes 
can display as many as 2^ trees. The TCP is to determine whether a network displays a 
phylogenetic tree or not. 

The set of all the labeled leaves in a subtree rooted at a node is called the cluster of the 
node in a phylogenetic tree. An internal node in a network may have different clusters in 
different trees displayed in the network. Given a subset of labeled leaves B C C{N), H is a 
soft cluster in N if B is the cluster of a node in some tree displayed in N. The CCP is to 
determine whether a subset B of C{N) is a soft cluster in a network N or not. 

3 Our Results 

Theorem 3.1 (Main Result) Given a binary reticulation-visible network N and a binary 
tree T, the TCP for N and T can be solved in (9(|£(A)p) time. 

The CCP is another important problem that has a quadratic-time algorithm for reticulation- 
visible networks [121 PP- 168-171]. Here, we design an optimal algorithm for it. The details 
of this algorithm are omitted due to page limitation and can be found in Appendix 3. 

Theorem 3.2 Given a binary reticulation-visible network N and an arbitrary subset B of 
labeled leaves in N, the CCP for N and B can be solved in (9(|£(A)|) time. 
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Synopsis of Algorithmic Methodologies The reader may wonder why solving the TCP 
is hard, as after all N is an acyclic digraph and T is a binary tree. First, it is closely related 
to the subgraph isomorphism problem (SIP). In general, it is very tricky to hnd out whether 
a special case of the SIP remains NP-complete or can be solved in polynomial time. For 
example, whether or not a directed tree FT is a subgraph of an acyclic digraph G is NP- 
complete, but can be solved in polynomial time provided that G is a forest [S]. Second, 
there are non-planar reticulation-visible networks. Hence, algorithmic techniques for planar 
graphs might not be suitable here. Thirdly, the TCP remains NP-complete even for binary 
networks in which each reticulation node has a tree node as its sibling and another tree-node 
as its child [20] . 

The TCP has been known to be solved in polynomial time only for tree-child networks [20] 
and the so called nearly-stable networks [Oj. In a tree-child network, each reticulation node 
is essentially connected to a leaf by a path consisting of only tree nodes. A simulation study 
indicates that tree-child networks comprise a very restricted subclass of reticulation-visible 
networks (Gambette, http://phylnet.univ-mlv.fr/recophync/). In a nearly-stable network, 
each child of a node is visible if it is not visible. Because of the simple local structure 
around a reticulation node in such a network, one can determine whether or not it displays 
a phylogenetic tree by examining reticulation nodes one by one. However, any approach 
that works on reticulation nodes one by one is not good enough for solving the TCP for a 
reticulation-visible network having the structure shown in Figure]^ if it has a larger number 
of reticulation nodes above the two reticulation nodes at the bottom. We need to deal with 
even the whole set of reticulation nodes simultaneously for reticulation-visible networks of 
this kind. 

Our algorithms for the TCP and CCP rely primarily on a powerful decomposition the¬ 
orem (Theorem 4.1). Roughly speaking, the theorem states that, in a reticulation-visible 
network, all non-reticulation nodes can be partitioned into a collection of disjoint connected 
components such that each component has at least two nodes if it does not consist of a single 
leaf. Most importantly, each component contains either a network leaf or all the parents of 
a reticulation node. 

The topological property uncovered by this theorem allows us to solve the TCP and 
CCP by the divide-and-conquer approach: We work on the tree components one-by-one in 
a bottom-up fashion. In the TCP case, when working on a tree component, we simply call 
a dynamic programming algorithm to decipher all the reticulation nodes right below it. In 
the CCP case, a slightly structural complex (but faster) dynamic programming algorithm is 
used. 


4 A Decomposition Theorem 

In this section, we shall present a decomposition theorem that plays a vital role in designing 
a fast algorithm for the TCP and CCP. We first show that reticulation-visible networks have 
two useful properties. 

Proposition 4.1 A reticulation-visible network N has the following two properties: 

(a) (Reticulation separability) The child and the parents of a reticulation node are tree 
nodes. 
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(b) (Visibility inheritability) Let E C S{N). IfN — E is connected and C{N — E) = C{N), 
then N — E is also reticulation-visible. 


Proof (a) Suppose on the contrary there are u,v E 71{N) such that v is the child of u. Let 
w be another parent of v. Since N is acyclic, w is not below v and hence not below u. Since 
w is not a descendant of u, there is a path P{p{N),w) from p{N) to x that does not contain 

u. 

We now prove that u is not visible on any leaf by contradiction. Assume u is visible on 
a leaf There is a path P' from p{N) to ^ containing u. Since v is the only child of u, v 
appears after u in P'. Dehne L\ to be the subpath of P' from v to The concatenation 
of P{p{N),w), {w,v), and P'[v,i] gives a path from p{N) to 1. However, this path does not 
contain u, a contradiction. 

(b) The proposition follows from the observation that the removal of an edge only elimi¬ 
nates some directed paths and does not add any new path from the root to a leaf. □ 


Consider a reticulation-visible network N. By Proposition 4.1, each reticulation node 


is incident to only tree nodes. Furthermore, each connected component C oi N — 71{N) 
(ignoring edge direction) is actually a subtree of N in which edges are directed away from its 
root. Indeed, if C contains two nodes u and v both of indegree 0, where indegree is dehned 
over N — 71{N), the path between u and v (ignoring edge direction) must contain a node x 
with indegree 2, contradicting that a: is a tree node in N. Hence, the connected components 
of V — TZ{N) are called the tree-node components of N. 

Let C be a tree-node component of N and V(C) denote its vertex set. It is called a 
single-leaf component if V(C') = {£} for some i G E{N). It is a big tree-node component 
if |V(C)| > 2. The binary reticulation-visible network in Figure |2j4 has four big tree-node 
components and hve single-leaf components. 

By dehnition, any two tree-node components C' and C" of N are disjoint. We say C' is 
below C" if there is a reticulation node r such that C" contains a parent of r and the child 
of r is the root of C. 


Theorem 4.1 (Decomposition Theorem) Let N be a reticulation-visible network with m 
tree-node components (Fi, (72, • • •, Cm- The following statements are true: 

(i) TiN) = Mr.iV(Ci). 

(ii) For each reticulation node r, its child c(r) is the root of some Ci, and each of its parents 
is a node in a different Cj. 

(iii) For each tree-node component Ck, 

(a) |V((7fc)| = 1 iffV{Ck) = {T\ for some i G C{N) (he. it is a single-leaf component), 
and 

(b) If Ck is big, either Ck contains a network leaf or a reticulation node r exists such 
that its parents are all in Ck- 

(iv) A big tree-node component C exists such that there are only single-leaf components 
below it. 
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Figure 2: (A) A binary reticulation-visible network having 9 tree-node components, of which 
four (C'i-C'4) are the big ones. (B) A tree for consideration of its containment in the network. 
When C = 6*4 is first selected, we focus on the path from the root vi to £4 in the given tree, 
where Lc = {£ 4 }, sc = 4 and dc = 3. 


Proof (i) The set equality follows from the fact that the tree-node components are all the 
connected components of A — C{N). 

(ii) Let r G 71{N). By the Reticulation Separability property in Proposition |4.1| c(r) and 
the parents of r are all tree nodes. Thus, by (i), each of them is in a tree-node component. 
Furthermore, since c(r) is of indegree 0 in A — TZ{N), c(r) must be the root of the tree-node 
component where it is found. 

(iii) (a) Let C be a tree-node component such that \C\ = 1. Assume C = {«} C 

T{N)\C{N). Since u is the only non-leaf tree node in C, p{u) G TZ{N) and c{u) C TZ{N). 
Any leaf descendant of p{u) must be below some child of u. Let c{u) = {ci, C 2 ,..., Ck} be the 
children of u. Since k is finite and A is acyclic, there is a subset S' of i children • • • , Ck^ 

such that (i) Ck^ is not below any node in c{u) for each j, and (ii) each child in c{u) is either 
in S or below some child in S. For each j < i, using the same argument as in the proof of 
the part (a) of Proposition 4.1, we can prove that for each leaf i below Cfe., there is path 


from p{N) to i not containing p{u). Since any leaf below p{u) must be below some child 
in S, p{u) is not visible. This contradicts the fact that A is reticulation-visible. Therefore, 
\C\ = 1 if and only it is a single-leaf component. 

(b) Assume that (P is a big tree-node component of A, that is, |V(C)| > 2 . Let p{C) be 
the root of C. p{C) and its reticulation parent are visible on some network leaf, say If ^ 
is in C, we are done. 

If ^ is not in C, we define A = {r G TZ{N) \ p{r) fl C 7 ^ 0 & i is below r}. For any 
r',r" G A, we write r' -<;t’ f'" if is below r", that is, there is a direct path from r" to r'. 
Since is transitive, A is finite and A is acyclic, A contains a maximal element with 
respect to -<x- Let pir^) = {pi,P 2 , • • ■ Since G A, we may assume that pi G V(C'). 

If pfcp ^ V(C') for some 1 < < k, pk^ is not below any node in C, as A is acyclic and 
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is maximal under -<x- Hence, there is a path P from p{N) to that does not contain any 
node in C. Since £ is a descendant of r^, P can be extended into a path from p{N) to £ that 
does not contain p{C). This contradicts that p{C) is visible on Therefore, the parents of 
Tm are all in C. 

(iv) It is derived from the fact that N is acyclic. □ 

Time complexity for finding tree-node decomposition Let be a binary reticulation- 
visible network. Since is a DAG and has at most 8 |£(A^)| nodes |6], we can determine 
the tree-node components using the breadth-hrst search technique in 0(|£(A^)|) time. Ad¬ 
ditionally, a topological ordering of its nodes can also be found in 0(|£(A^)|) time. Using 
such a topological ordering, we can derive a topological ordering for the big tree-node com¬ 
ponents. With this ordering, we can identify a lowest tree-node component described in 
Theorem |4.lK iv) in constant time. 

For non-binary networks, the above process for determining the big tree-node components 
and a lowest one in 0(|V(A^)| -|- |£^(A^)|) time. 

We will use the decomposition theorem to develop fast algorithms for the TCP in Sec¬ 
tion The theorem also seems to be very useful for studying the combinatorial aspects of 
networks. A network is galled if each reticulation node r has an ancestor u such that two 
internal-node disjoint paths exist from u to r in which all nodes except r are tree nodes. 
Galled networks are reticulation-visible |12j . 

Theorem 4.2 For any arbitrary galled network N with n leaves, |7^(A^)| < 2{n — 1). 

Proof Let A^ be a galled network with n leaves. It is reticulation-visible [12]. We consider 
the tree-node components of N. Since the root of each tree-node component is either the 
network root or the unique child of a reticulation, 

|7?.(A^)| = (no. of tree-node components in N) — 1. (1) 

We hrst show that N does not contain any cross reticulations. Suppose on the contrary 
N contains a cross reticulation r. By the dehnition of cross reticulation, the parents of r 
are in different tree-node components. Assume pi and p 2 are two parents of r in different 
tree-node components. Since N is acyclic, we may assume that pi is not below p 2 - Let Cp^ 
be the tree-node components containing pi for i = 1,2. We consider the parent r' of ^(Gpg). 
r' is a reticulation node. Furthermore, p 2 is below r' and hence r is also below r'. However, 
we can reach r from pi using a single edge without passing through r', contradicting the 
Separation Lemma for galled networks [121 p. 163]. 

We have proved that N does not contain any cross reticulations. Therefore, all the tree- 
node components are connected in a tree structure. Precisely, if G is the graph whose nodes 
are the tree-node components in which a node X is connected to another T by a directed 
edge if the tree-node components represented by them are separated by a reticulation node 
between them, G is then a rooted tree. 

Gonsider a leaf I in G. If the tree-node component represented by I is not a single-leaf 
component in N, there is no reticulation node below it and thus it must contain a leaf of 
N. Therefore, G has at most n leaves and thus at most 2n — 1 nodes. In other words, N 
contains at most 2n — 1 tree-node components. By Eqn. ([^, |7^(A^)| < 2{n — 1). □ 


5 A Cubic-time Algorithm for the TCP 

In this section, we first present a polynomial-time algorithm for the TCP in the case when the 
given reticulation-visible network is binary. Then, we describe how to modify the algorithm 
into one for non-binary networks. 

5.1 An algorithm for binary networks 

In this subsection, we assume the given network is binary in which a tree node has two 
children and a reticulation node has two parents. We remark that each internal node has 
two children in a phylogenetic tree. 

Let iV be a binary reticulation-visible network and T be a tree over the same leaves. A 
reticulation node in N is inner if its parents are all in the same tree-node component of N. 
It is called a cross reticulation otherwise. 

By Theorem EH there exists a “lowest” big tree-node component C below which there 
are only (if any) single-leaf components (Figure]^. We assume that C contains k network 
leaves, say £ 1 , £ 2 , • • • , ^k-, and there are: 

• m inner reticulations IR(C) = {ri,r2, ■ ■ ■ and 

• n cross reticulations CR(C') = {r'^,r 2 , • • ■ ,r'^,} 

below C. Since C is a big tree-node components, it has two or more nodes, implying that 
k + m + m' >2. 

Let p{C) denote the root of C. We further define: 

Lc = 4, c(ri), c(r 2 ), • • • , c{rm)] ■ (2) 

By Theorem |4.1[ iii)(b), fc -|- m > 1 and so Lq is non-empty. Each path P from p{N) to 
c(rj) G Lc must contain r*. Since the parents of r* are all in 4, P must contain p{C). Hence, 
p{C) is visible on each network leaf in Lc- 

We select an £ G Lc- Since T has the same leaves as N, i E C{T) and there is a unique 
path Pt from p{T) to i in T. Let: 


Pt : Vi,V2, - - -,Vt,Vt+i, (3) 

where vi = p(T) and Vt+i = I- Then, T — Pt is a union of t disjoint subtrees Ti, T 2 ,..., Tt, 
where Tj is the subtree rooted at the sibling of Vi+i for each i = 1, 2, • • ■ ,t (see Figure [^). 
For the sake of convenience, we consider the single leaf i as a subtree, written as Tt+i- We 
now define Sc as: 

Sc = min{s | C{Ts) fl Lc 7^ 0} (4) 

Since i G C{Tt+i) fl Lc, sc is well defined. In the example given in Figure sc = 4. 

Proposition 5.1 The index sc can he computed in 0(|£(A^)|) time- 

Proof Since T is a binary tree with the same set of labeled leaves as the network N- T has 
2|£(A^)| — 1 nodes and 2|£(A^)| — 2 edges. For each x G V(T), we define a flag variable 4 to 
indicate whether the subtree below x contains a network leaf in Lc or not. We first traverse 
T in the post-order: 
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• For a leaf x G jC(T), f^ = lifuE Lq and 0 otherwise. 

• For an non-leaf node x with children y and z, = max{/y, f^}. 

Then, we compute sc as sc = min{i | fp{Ti) = !}• Clearly, this algorithm correctly computes 
Sc in 0(|£(T)|) time. □ 


Proposition 5.2 If N displays T, then Vt ivsc) is displayed in (p(C')). 

Proof When sc = t + 1, then the statement is trivial, as p{C) is visible on i and thus every 
path from the network root to i must contain p{C). 

When Sc < t + 1, by the dehnition of sc, there is a network leaf f in C(Ts^) fl Lq such 
that I' ^ i. If iV displays T, T has a subdivision T' in N. Recall that p{C) is visible on 
both I and I'. The paths from p(T') to I and to F in T' must both contain p{C). Since T' 
is a tree, the lowest common ancestor a(£, I') of I and is a descendant of piC) in T' and it 
is the node in T' that corresponds to Vg^- Therefore, the subnetwork of T' below a(£, £') is 
a subdivision of Vt that is, Vj^ {p{C)) displays Vt {vsc)- D 

If N displays T, then C may display more that Vt ivsc)- In other words, it may display 
a subtree Vt {vj) for some j < sc- We dehne: 

dc = min {j \ Vt (vj) is displayed in V^ {p{C)) } (5) 

In the example given in Figure dc = -i- 

Proposition 5.3 If N displays T, there must he a subdivision T" ofT in N such that the 
node {in T") corresponding to Vdc is in C. 

Proof Assume that N displays T via a subdivision T' of T. Let u be the node in T' that 
corresponds to v^c- Since p{C) is visible on the leaf £, p{C) is in the unique path P from 
the root to £ in T'. If m is p{C) or below it, we are done. 

Assume that u is neither p{C) nor below p{C) in T'. Since £ is a network leaf below Vdc 
in T, I is also below u in T'. Hence, P must also contain u. Since u is not below p(C), p{C) 
must be below u in P. 

On the other hand, by assumption, VT{vdc) is displayed in V]t{p{C)). It has a subdivision 
T* in V]^{p{C)). Let correspond to u' in T*. It is not hard to see that u' is in the path 
from p{C) to I in T*. 

Let P' be the subpath from u to p{C) of P and P" be the path from p{C) to u' in C. 
Since the subtree below u' in T* and the subtree below u in T' has the same set of labeled 
leaves as the subtree below Vdc in T, 

r - Vt' (m) +P' + P" + Vt* {u') 

is also a subdivision of T in N, in which Vdc is mapped to u' in C. Here, G + H is the graph 
with the same node set as G and the edge set being the union of E{G) and E{H) for graphs 
G and H such that V{H) C V{G). □ 

To compute dc dehned in Eqn.(|^, we create a tree Tc from G by attaching two identical 
copies of the network leaf below each r G IR(C') to its parents in p{r) in G and one copy of 
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Figure 3: Illustration of a lowest tree-node component C in a binary reticulation-visible 


network and the corresponding tree Tc constructed for computing dc in Proposition 5.4 


the network leaf below r G CR(C') to the parent in p(r) fl V{C). That is, Tc has the node 
set: 

V(rc) = V(C') u {Xr, yr\r e IR(C)} u {z, | r e CR(C)}, (6) 

and the edge set 

S(Tc) = £{C) U {{Ur,Xr),{Vr,yr) \ r elR{C) with p{r) = {Ur,Vr} } 

U {(ur, Zr) \ r ^ CR(C) with {ur} = p{r) fl V{C) }, 

where |/r, and Zr are new leaves with the same label as c(r) for each r in IR(C') or CR(C'). 
Tc is illustrated in Figure 

Proposition 5.4 There is a dynamic programming algorithm that takes Tc and T as input 
and outputs dc defined in Egn. in 0(|V(C')|2|V(T)|) time. 

Proof It is a special case of a problem studied in [22]. 

For each r G IR(C'), Tc contains two leaves with the same label as c(r). To detect 
whether or not Vt (vj) is displayed in Vjv (p(C)), we have to consider which of these two 
leaves will be removed. Such leaves will be referred to the ambiguous leaves. We use A(Tc) 
to denote the set of ambiguous leaves in Tc- 

For each r G CR(C'), Tc contains one leaf with the same label as c(r). Similar to the 
case of ambiguous leaves, this leaf may be removed or kept. Such leaves are called optional 
leaves. We use 0{Tc) to denote the set of optional leaves in Tc- 

Since each node in C is a tree node of degree 3 in N, Tc is a full binary tree with at most 
2|V(C)| -I- 1 nodes. 

For our purpose, we shall present a dynamic programming algorithm to compute the 
following set of nodes in T : 

S'u = {x G V(T) I Vtc{u) displays Vt{x) where u is mapped to x} 

for each node u in Tc- Here, that Vtc{u) displays Vt{x) means that 14 ^ exists such 

that Tc — 14 is a subdivision of Vt{x). Since Tc is a tree, 14 consists of all the ambiguous 
or optional leaves that are in Vtc{u) but not in Vt(x) and some ancestors of these leaves. 
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We introduce a boolean variable fux to indicate whether or not displays T>t{x) 

and a set variable Mux to record those removed leaves if so. That is, 

_ / 1 a Vtc{u) displays Vt{x), , . 

\ 0 otherwise, ^ ^ 

and 

Mux = C,Tc{u)\Ct{x) (9) 

if fux = 1, where Cy{z) denotes the set of leaves below 2 ; in the tree Y. 

When M is a leaf in Tc, we consider whether a; is a leaf or not to compute fux- 
If a; is a leaf with the same label as u, then (m) displays Vt{x) and thus we set fux = 1 
and Mux = 0- 

If a: is a leaf and its label is different from the label of u, VT^iu) does not display Vt{x). 
In this case, fux = 0 and Mux is undehned. 

If X is not a leaf, it is trivial that Vj-ciu) does not display Vt{x). Hence, fux = 0 and 
Mux is undehned. 

When u is an internal node with children v and w in Tc, we consider similar cases. If 
X is a leaf, Vt{x) may or may not be displayed in Vtc{u) if it is displayed below a child of 
u. If T>t{x) is displayed below v, it is also displayed at u only if every ambiguous leaf in 
Mux is not below w and all the leaves below w are either ambiguous or optional. If T>t{x) is 
displayed below neither v nor w, it is not displayed at u. The remaining cases can be found 
in Table [l] 

Our dynamic programming algorithm recursively computes fux for each u and x by 
traversing both Tc and T in the post-order. For each u and x, we compute fux using the 
formulas listed in Table Note that Mux is a subset of A{Tc) U 0{Tc) and hence has at 
most 2|V(C')| elements. Therefore, each recursive step takes 0(|V(0)|) time. Because of 
this, the total time taken by the algorithm is 0(|V(T)| ■ |V(C')p). 

After we know the values of fux for every u in Tc and every x in Tc. we can compute dc 
such that Vt^vj) is displayed in Viy{p{C)) as 


dc 


min 


{j I fu 


1 for some u G V{Tc)}- 


□ 
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X G £(r)? 


fvx L 


fux Slid Mux 


Yes 


a. 


fux - 


1 if M^x n M^x = 0, 


0 otherwise. 

Mux — Mux U Mujx U {x} if fux — 1 

(b.) 

r 1 ifC{VTaiw))CA{Tc)UO{Tc), 
fux= < & Mux n C{Vtc{w)) = 0, 

0 otherwise. 

Mux = Mux u C{VTaiw)) if fux = 1 
(c.) 

r 1 ifC{VTciv))CA{Tc)uO{Tc), 

fux - < & Muux n C{Vtc{v)) = 0, 

0 otherwise. 

Mux = Muox u CCDTciv)) if fux = 1 

fux = 0 


No. 

c{x) = {y,z} 

1 

(d.) 1 

fux is defined same as the leaf case 

Mux = Mux U Mu,x U C{Vt{x)) if fux = 1 


1 

0 

(e.) same as the leaf case 


0 

1 

(f.) same as the leaf case 


0 

0 

(h.) 

( 1 A fuy = 1 k fwz = 1 

k Muy n Mu,z = 0, 

fux= < 1 if fvz = 'i- ^ fwy = 1 

k Muz n Mu,y = 0, 

0 otherwise. 

f Muy U Muoz if /t,j/ = 1 & fwz = 1, 

\ Muz U Mu,y if /„. = !& Uy = 1. 


Table 1: The recursive formulas on fux for different cases when u is an internal node in Tc- 
Based on the facts presented above, we obtain the following algorithm for the TCP. 
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The TCP Algorithm 

Input: A reticulation-visible network N and a tree T, which are binary. 

1. Decompose N into tree-node components: Ci -< C 2 -<■■■-< Ct, 
where -< is a topological order such that that no directed path 

from a node u in Cj to a node v in Ci exists if 3 >i-, 

2. N' ^ N and T' ^ T; 

3. Repeat unless (iV' becomes a single node) { 

3.1. Select a lowest big tree-node component C; 

3.2. Compute Lc in Eqn. 

3.3. Compute the path Pt 

3.4. Determine the smallest index sc dehned by Eqn. (4); 

3.5. Determine the smallest index dc dehned by Eqn. (5); 

3.6. If {sc > dc), output “iV does not display T”; 
else { 

For each r G CR(C) { 

if (c(r) ^ VT{vdc)), delete {z,r) for z e p(r) fl V(C); 
if (c(r) G VT^Vdc)), delete {z,r) for z G p(r)\V(C); 

} 

Replace C (resp. T>T{vdc)) by a leaf ic in N' (resp. T'); 
Remove C from the list of tree-node components; 

Update CR(C') for the affected big tree-node components C'; 
} /* end if */ 

} /* end repeat */ 


2J and select i G Lc] 
rom the root to I in Eqn. (|3 


We now analyze the time complexity of the TCP Algorithm. Note that N has at 
most 8|£(A^)| nodes [6] and the input tree contains 2|£(iV)| — 1 nodes. Step 1 can be done 
in 0(|£(A^)|) time if the breadth-hrst search is used. 

Step 3 is a while-loop. During each execution of this step, the current network is obtained 
from the previous network by replacing the big tree-node component examined in the last 
execution with a new leaf node. Because of this, the modihcation done in the last two lines 
in Step 3.6 makes the tree decomposition of the current network available before the current 
execution. Hence, Step 3.1 takes a constant time. The time spent in Step 3.2 for each 
execution is linear in the sum of the numbers of the leaves in C and of the inner reticulations 
below C. Hence, the total time spent in Step 3.2 is 0(|£(iV)|), as each reticulation is 
examined twice at most. 

By Proposition 5.1, the total time spent in Step 3.4 is 0(|£(iV)p). By Proposition 5.4 


the total time spent in Step 3.5 is 'YliiO{\V{Ci)\‘^\C{T)\), which is 0(|£(iV)p). The time 
spend in Step 3.6 for each execution is 0(|V(C)|. Hence, the total time spent in Step 3.6 
is 0(|£(A^)|. Taken altogether, the above facts imply that the TCP Algorithm takes 
0(|T(A^)|^) time, proving Theorem 3.1 
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5.2 Generalization to non-binary networks 


The algorithm in the last subsection can be easily modified into a polynomial time TCP 
algorithm for non-binary reticulation-visible networks in which tree nodes are of indegree 1 
and outdegree greater than 1 , whereas reticulation nodes are of indegree greater than 1 and 
outdegree 1. First, we have proved the decomposition theorem for arbitrary reticulation- 
visible networks. Second, the concepts of inner and cross reticulation remain the same. 
Third, Propositions 5T 5^ still hold. The only modification we have to make is on the 
definition of Tc, appearing in Step 3.5, and on the proof of Proposition 5^ It can be done 
as follows. 

Eqn. (§ and Q now become 


V{Tc) = V(C') 

U I r G IR(C') & 1 < z < |p(r)| } 

U (4*^ I ^ £ CR(C) having k parents in C k.1 <i <k}, ( 10 ) 

and the edge set 

8{Tc) = S{C) 

U I r G IR(C') with p(r) = <i<k}} 

U I ^ e CR(C') with p(r) n V{C) = {m «|1 <i<k]], ( 11 ) 

where x^\ and Zr^'^ are new leaves with the same label as c(r) for each r in IR(C') or CR(C). 
We further define: 


Ar = I 1 < « < |p(?")|}- 


( 12 ) 


for r G IR(C). 

We now describe how to compute fux defined in Eqn. (|^ when m is a non-leaf node in N. 
(When M is a leaf, we can computer fux in the same way as in the case when N is binary.) 

When X is a leaf in T, fux = 0 if fux = 0 for each v G c{u). Conversely, if fy'x = 1 for 
some fux = 1 only if 

(i) C{Vtc{v)) C A{Tc) U 0{Tc) for each v ^ v' in c{u), and 

(ii) Ar ^ My'x U [u„gc(u)\{i;'}'C('DTc(n))] for each r G IR(C'), where My'x is defined in 
Eqn. (|^. 

The reason for (i) is that if the display of 'Dt{x) is extended from the subtree rooted at v' to 
the subtree rooted at m, we have to delete the subtrees rooted at any other child of u. The 
reason for (ii) is that we cannot delete all the leaves added for each r G IR(C). 

It is possible that fux = 1 for different children v of u. However, we would like to point 
out, whether the conditions (i) and (ii) hold or not is independent of which child is selected 
when it happens. 

When X is not a leaf, we assume c(x) = {y, z}. Clearly, if fux = 0 for each v G c{u) and 
no v' and v" exist in c{u) such that fu'y = 1 and fu''z = 1 , then fux = 0. 

If fv'x = 1 for some v' G c{u), we can determine whether or not fux = 1 in the same way 
as in the case when x is a leaf. 

If fu'y = 1 and fu"z = 1 for v', v" G c(-u) such that v' 7 ^ x", fux = 1 only if 


15 






(i) CiVTciy)) C A{Tc) U 0{Tc) for each v G c{u) such that v' ^ v ^ v", and 

(ii) Ar ^ U U [U^6c(«)\{^',i;"}'^(^^rc(^'))] for each r G IR(C). 

Again, the two conditions are independent of which v' and v" are selected. 

To efficiently check the conditions (i) and (ii), we introduce some integer variables for 
each node. For each r G IR(C) and each node u in Tc, rriur denotes the number of leaves 
in C{Vtc{u)) that are in Ar ; rriuo denotes the number of non-ambiguous and non-optional 
leaves in C(Vtc{u)). It is not hard to see that m„o can be recursively computed using 




1 

0 


'^ v £ c { u ) ’^^'0 

Similary, mur can be updated as follows: 


if M is a leaf in neither A{Tc) nor 0{Tc), 
if M is a leaf in A{Tc) or 0(Tc), 
if M is a non-leaf node. 


(13) 




1 

0 

E. 


v&c{u) '""vr 


if M is a leaf in Ar, 
if M is a leaf not in Ar, 
m„r if M is a non-leaf node. 


(14) 


The subset relation in the condition (i) is equivalent to that rriyo = 0. Note that 
C{T>Tc{u)) n Ar = [Myi^ U U (U„gc(n)\{i;'y'}'^(^^rc('r')))] Fl Ar for each r G IR(C) such 
that its leaf child c(r) is not in C{Vt{x)). As the condition (ii) is clearly true if c(r) is in 
C{Vt{x)), it is equivalent to that < |Ar| for each r G IR(C) such that c(r) is not in 
C{Vt{x)). Thus, we can update fux using the formulas in Table 

Using Eqn. (13) and (14), we can pre-compute rriuo and niur for all r G IR(C) in \£{Tc) \ ■ 
(|IR(C')| -|- 1) time. 

When fux is updated, we need to check whether or not = 1 and fyy = 1 for each 
y G c{x) and each child v G c{u). This can be done in 0(|c(m)|) time. Similarly, the condition 
(i) is independent of x and can be checked in 0(|c(m)|) time; the condition (ii) can be checked 
in 0(|IR(C)|) time. Hence, for all nodes x in T, the run time for updating fy^ in Tc takes 
<^(E«ev(Tc)(l^(“)l + = 0{\£{Tc)\ + |IR(C')| ■ |V(Tc)|) time. This implies that the 

run time on Tc for all nodes in T is 0(|V(T)| ■ [|T(T( 7 )| -|- |IR(C)| ■ |V(T( 7 )|] ) time. 

Therefore, the total run time for determining whether or not T is displayed in N is 
0(|V(T)| • [|£(iV)| + |7^(iV)| ■ |^(iV)|]) = 0(|V(T)| ■ |^(iV)| ■ |7^(iV)|) time. In general, |£(iV)| 
is not bounded by any function linear in the number of leaves in an arbitrary network. 


6 A Linear-time Algorithm for the CCP 

As another application of the Decomposition Theorem, we shall design a linear-time al¬ 
gorithm for the CCP. We hrst present a desired algorithm for binary reticulation-visible 
networks and then generalize it to non-binary networks. 

6.1 Algorithm for binary networks 

Given a binary reticulation-visible network N and a subset B C C{N), the goal is to deter¬ 
mine whether or not R is a cluster of some node in a tree displayed by N. 
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X G £(T)? 

fux cind M'lix 

Yes 

Case 1: fyx = 0 for each v G c{u) 

fux = 0 ; 


Case 2: fy^x = 1 for some v' G c{u) 

If (i) niyo = 0 for each n 7 ^ n' in c{u), and 

(ii) niur < Ar for each r G IR(C) such that Ay H C{Vt{x)) = 0, { 

fux — 1 ) 

} else fux = 0] 

No. 

c(x) = {y,z} 

Case 1: fyx = 0 for any v G c(n), and 

no v' and v" in c(u) exist such that fy/y = 1 and fy^z = 1 ; 

fux = 0 ; 


Case 2: fy^x = 1 for some v' G c(u) 

Use the same updata rule as in the case 2 when x is a leaf 


Case 3: fy^y = 1 and fyu^ = 1 for some v' 7 ^ v" in c(u) 

If (i) ruyo = 0 for each v G c(u) such that v' ^ v ^ v”, and 
(ii) niur < \Ay\ for each r G IR(C) such that Ay n C{Vt{x)) = 0, { 

fux — 1 ) 

} else fux = 0] 


Table 2: The update rules for computing fux for a node u in an arbitrary network. 


Assume N has t big tree-node components C'i,C' 2 ,-- - ,Ct. Consider a lowest big tree- 
node component C. We use the same notation as in Section]^ Lc is dehned in Eqn. ([^; 
p{C) denotes the root of C; IR(C') and CR(C') denote the set of inner and cross reticulations 
below C, respectively. We also set B = C{N)\B. 

When Lc fl R 7 ^ 0 and Lc fl R 7 ^ 0, Lc contains ii and £2 such that ii G R, but £2 ^ R. 
If R is the cluster of a node z in a subtree T' of N, z is in the unique path P from p(T') 
(= p{N)) to £i in r. 

Assume 2 ; is between p{N) and p{C) in P, no matter which incoming edge is contained 
in T' for each r G IR(C), £2 is below p{C), as p{C) is visible on £ 2 . This implies that £2 is 

below 2 ; and thus in R, a contradiction. Therefore, if R is a soft cluster, it must be a soft 

cluster of a node in C. 

When Lc fl R = 0 (that is, Lc P R), we dehne 

A = {r G CR(C') I c(r) ^ R}. (15) 

Construct a subtree T' of Vn{p{C)) by deleting: 

• all but one of the incoming edges for each r G IR(C'), 

• all incoming edges but one with a tail not in C for each r G A, and 

• all incoming edges but one with a tail in C for each r G CR(C')\A. 
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We then define 


B = LcU {c(r) I r e CR(C')\X}. (16) 

It is not hard to see that B is the cluster of the root of T' such that B = BnC(VN{p{C))) C 
B. Hence, if B = B, then H is a cluster contained in T>m{p{C)). If B B, we reconstruct 
N' from N by: 

• removing all the edges in {(m, r) G S{N) \ r E X,u E V(C)}, 

• removing all the edges in {(m, r) G £iN) \ r E IR(C') U CR(C')\X, m ^ V(C)}, 

• removing all but one of edges in {{u,r) E £{N) \ r E IR(C) U CR(C)\X, m G V(C)}, 
and 

• replacing V]^{p{C)) by a new leaf tc- 

and set B' = {^c} U B\B. We have the following fact. 


Proposition 6.1 If Lc fl R = 0, R is contained in N if and only if B' is contained in N'. 

Proof Recall that R' = (R U {ic})\B. If B' is the cluster of a node 2 ; in a tree T" displayed 
in N'. When N' was reconstrcuted, Ic replaced the subtree T' rooted at piC) whose leaves 

are R; so if we re-expand Ic info T', the cluster of 2 ; in becomes ^R\R^ U R = R, thus 
R is contained in N. 

Assume R is the cluster of a node 2 ; in a subtree T displayed in N. Let E be the set of edges 
entering the reticulations nodes such that T = N — E. Since R 7 ^ R = R fl £('RAr(p(C))), 
there is a leaf i E B not below p{C). Since i is below 2 ; in T, 2 ; must be above p{C) in T. 

Consider a reticulation node r G CR(C)\X. Since c(r) is a leaf in R, it is a leaf below 
2 ; in T. By definition of cross reticulation, r has at least one parent in C. Let {pcx) bo an 
edge such that pc E C. Note that all but one of that incoming edges of r are in E. Define 


E' = [E \{{pc,r) I r G CR(C')\A:}] U {(p,r) G £{N) \ r E CR{C)\X k p ^ V(C')} 


. It is not hard to see that {pc, r) is the unique incoming edge of r not in E' for each r ^ X. 

Let T' = N — E'. It is easy to see that the cluster of 2 : is equal to R and R is the cluster 
of p{C) in T'. Therefore, if we contract the subtree below p{C) into a single leaf ic, the 
cluster of z becomes R U {ic}\B, which is B'. Therefore, R' is contained in N'. □ 


When R n Lq = ij), B may or may not be contained in Vpyi^piC)). If it is not, we use X 
defined in Eqn. (15) to reconstruct N' from N by: 

• removing all the edges in {(m, r) G £{N) \ r E CR(C') s.t. c(r) ^ B,u ^ V(C')}, 

• removing all the edges in {(m, r) G £{N) \ r E CR(C) s.t. c(r) E B,,u E V(C)}, and 

• replacing Vpy^p^C)) by a new leaf ic- 
Similar to the last case, we have the following fact. 


Proposition 6.2 If B is not in T>]s[{p{C)) and Lc B = fj), it is a soft cluster in N if and 
only if B is a soft cluster in N'. 
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Proof N' is a subnetwork of N. If 5 is a soft cluster in N', it is a soft cluster in N. 

Conversely, assume B is the cluster of a node z in a. subtree T of N. Let E be the set of 
reticulation edges such that T = N — E. By assumption, B is not contained in Visi{p{C)). 
Since p{C) is visible on all leaves in Lc, p{C) is not below 2 ; in T'. Therefore, p{C) and 2 ; 
do not have ancestral relationship. 

Consider a reticulation r G CR(C). Since r is a cross reticulation, it has a parent Pr in 
C. If c(r) G B, c(r) is a leaf below 2; in T and thus the unique incoming edge not in E has 
a tail not in C. If c(r) ^ B, the unique incoming edge not in E may or may not have a tail 
in C. 

We define: 

E' = {EU {(p, r) I r G CR(C') s.t. c(r) ^ B,p ^ C}) \ {{pr, r)\r G CR(C) s.t. c(r) ^ B}. 

Note that (pc, r) is the unique incoming edge of r not in E' for each r G CR(C') such that 
c(r) ^ B. Let T' = N — E'. It is easy to see that the cluster of z in T' remains the same as 
the cluster of z in T, which is equal to B. If we contract Vt'{p{C)) into a single leaf ic, T' 
is a subtree of iV', implying that R is a soft cluster in N'. □ 

We next show that whether or not R is in C can be determined in linear time. 

Proposition 6.3 LetTc be a subtree constructed from C in Rgn. ([^ and 0- An algorithm 
exists that takes Tq and B as inputs and determines whether or not B is a soft cluster in C 
in 0{\£{Tc)\) time. 

Proof First, we check whether or not each of the leaves in R is below p{C). If R ^ 
R('D 7 v(p(C)), R is not a soft cluster in Di^{p{C)). We assume that |R| > 2 and all its leaves 
are found below piC). 

Assume R is a cluster of a node n in a subtree T of Viq{p{C)). Clearly, any network leaf 
in C is not below v if it is not in R. For each r G IR(C'), p' G p(r) exists such that {p',r) 
was removed. If c(r) ^ R, the parent of r other than p' must not be below v. Since R is not 
singleton, v is an internal node in Tc. Taken together, these facts imply that v satisfies the 
following properties: 

(i) Every leaf in R is below v. 

(ii) if a leaf below v is neither ambiguous nor optional, it must be in R. 

(iii) If the ambiguous leaves introduced for a r G IR(C') are both below v, then c(r) G R. 

Conversely, if an internal node v of Tc satisfies the above three properties, we can then 
construct a subtree T of W in which R is the cluster of v as follows: 

• For each r G IR(C) U CR(C) below v, if it has a parent pi below v and another parent 
P 2 not below v, remove (p 2 ,'r) if c(r) is in R and remove (pi,r) otherwise. 

• For any other reticulation node not below v, choose an arbitrary incoming edge to 
remove. 

It is easy to verify that R is a cluster of v in the resulting tree. 

Assume R is a cluster of an internal node v in 'D]^{p{C)) and £ G R. If £ is not the child 
of an inner reticulation node below C, then v is contained in the path from p{C) to ^ in Tc- 
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Otherwise, v is contained in the path from p{C) to one of the ambiguous leaves dehned for 
^ in Tc- 

Based on the above facts, we obtain Algorithm 1 (Table to determine whether B is 
a cluster in C. 

The correctness of the algorithm follows from the following facts. When the algorithm 
stops with answer “No” during the traversal of the subtree branching off at u in the path 
from p{Tc) to the leaf x. It means that two ambiguous copies of a leaf not in B have been 
seen in the subtree below u. This implies that for any descendant w of u, not all leaves in 
B have been seen below w and for u and each of its ancestor, some leaf not in B has two 
ambiguous copies below it. Hence, no node exists in the path from the root to x in Tq that 
satishes the properties (i)-(iii). When the algorithm exists at Step 4, B is clearly not a soft 
cluster in C. 

When the algorithm stops with answer “Yes” at u inside Step 3.4, u is the lowest node 
satisfying the three properties. Hence, B \s a. soft cluster in C. 

Next, we analyze the complexity of Algorithm 1. Step 1 and Step 2 can be done in 
0{\C{Tc)\) time. Since there are at most two leaves with that same label as ^ in Tc, outer 


Algorithm 1 

Input: Tc and a subset B of leaves in Vi^{p{C))] 

1. If \B\ = 1, output “Yes” and exit; 

2. Set k = \B\ and m = |£(T>Ar(p(C)))| — k] 

Set a fc-tuple Y[l../c]; /* record if a leaf in B has been seen */ 

Set a m-tuple Z[l..m]\] 

/*Use Z to record the copies of i ^ B have not been seen so far */ 

3. Select i E B. 

for each leaf x in Tc that has the same label as £ { 

3.1 u = X] f = 1] /* f is the no. of leaves in B have been seen so far */ 

3.2 For each i from 1 to k, Y[i] = 0; 

3.3 For each i from 1 to m 

if (the leaf F ^ H is ambiguous or optional) Z[i\ = 2 else Z[i\ = 1; 

3.4 Repeat while u ^ p{C) { 

v' = u] v" = (the sibling of m); n = p{u)\ 

for each f" in the subtree { /* Traverse the subtree Tc rooted at v” */ 
if (f G B having rank j) & {Y\j] == 0) { Y[j] = !;/ = / + !}; 
if ^ B having rank j) 

Z[j] = Z[j] - 1; if {Z[j] == 0) stop Step 3.4; 

} /* end for */ 

if (/ == \B\) output “Yes” and exit; 

} /* end repeat */ 

} /* end the outer for */ 

4. output “No” and exit; 

Table 3: An algorithm to decide whether a leaf subset R is a soft cluster in C. 
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for-loop will execute twice at most. At each execution of the for-loop, Steps 3.1-3.3 take 
0{\C{Tc)\) time. In Step 3.4, we may traverse different subtree of Tc that branch off at a 
node in the path from the root of Tc to x, so the total running time for Step 3.4 is 0{\S(Tc)\). 
Hence, the algorithm runs in 0{\£{Tc)\ + |£(Tc-)|) = 0(|V(C)|), as \£{Tc)\ < 3|V(C)| in a 
binary network. 

□ 

Taking all the above facts together, we are able to give a linear time algorithm for the 
CCP. 

The CCP Algorithm 

Input: A binary network N and a subset B C C{N)] 

1. Compute the big tree-node components sorted in a topological order; 

Cl ^ C2 ^ ^ Ct; 

2. for /c = 1 to t do { 

2.1. Set C = Cfc; compute L := Lc^, dehned in Eqn. (1); 

2.2. Y ■= {Is B contained in T>j^{p{C))l ); 

2.3. if {Y == 1) output “Yes” and exit; 

2.4. if (Y == 0) { 

B := C{N)\B; 

ii {L n B ^ ^ B n L ^ output “No” and exit; 
if (H n L == 0) { 

Remove edges in {{u,r) \ r G CR(C) s.t. c(r) ^ B,u^ C}; 

Remove edges in {{u,r) \ r G CR(C) s.t. c(r) E B,uE C}; 

} 

if (R n L == 0) { 

Remove edges in {{u,r) \ r G CR(C) s.t. c(r) ^ B,uE C}; 

Remove edges in {{u,r) \ r G CR(C) s.t. c(r) E B,u^ C}; 

R := (R U {ic}) \ (T U {c(r) | r G CR(C) s.t. c(r) G R}); 

} 

Replace Vjq{p{C)) by a leaf ^c'■, 

Remove C from the list of big tree-node components; 

Update CR(C') for affected big tree-node components C'; 

} 

} /* for V _ 

The obtained CCP algorithm runs in linear time. Step 1 takes 0(|V(Y)|) = 0(|£(Y)|) 
time, as N is binary. Step 2 is a for-loop that runs t times. Since the total number of the 
network leaves in and the reticulation nodes below Ck is at most 3|V(Cfc)|, Step 2.1 takes 
0(|V(Cfc)|) time for each execution. In Step 2.2, the linear-time Algorithm 1 is called to 
compute Y in 0(|V(Cfc)|) time. Obvionsly, Step 2.3 takes constant time. To implement Step 
2.4 in linear time, we need to use an array A to indicate whether a network leaf is in R or 
not. A can be constrncted in 0(|£(Y)|) time. With A, each conditional clause in Step 2.4 
can be determined in \L\ time, which is at most 0(|V(Cfc)|). Since the total nnmber of inner 
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and cross reticulations is at most 2|V(Cfc)|, each line of Step 2.4 takes at most 0(|V(Cfc)|) 
time. Hence, Step 2.4 still takes 0{\V{Ck)\) time. Taking all these together, we have that 
the total time taken by Step 2 is ~ Therefore, 

Theorem |3.2| is proved. 


6.2 Generalization of the CCP algorithm to non-binary networks 

Propositions 6.1 and 6.2 have been proved for non-binary reticulation-visible networks. The 
straightforward generalization of Algorithm 1 does not give a linear-time algorithm for 
determining whether a subset B of leaves is a soft cluster in Bi^{p{C)) for non-binary net¬ 
works, as the outer for-loop in Step 3.4 will run k times if Tc contains k ambiguous/optional 
leaves that have the same label as the selected leaf in B. However, it is widely known that 
the lowest common ancestor (lea) of any two nodes in a tree can be computed in 0(1) after 
a linear-time pre-processing. In the rest of this subsection, we will use this result to prove 
Proposition 6.3 for non-binary networks. 

Given a non-binary reticulation-visible network N and a H C C{N) such that \B\ > 1, 
We work on a lowest big tree-node component C of N. Let Tc be the tree dehned in Eqn. ( 10 ) 
and S' For each r G IR(0), Ar denotes the set of ambiguous leaves dehned in Eqn. ( [l^ 
and lca(r) denotes the lea of the leaves in Ar. 

Proposition 6.4 All the nodes in = {lca(r) | r G IR{C)} can be computed in 0(|T(A^)|) 
time. 

Proof We hrst pro-process Tc in 0{\£{Tc)\) time so that for any two nodes u and v in T^, 
lca(M,n) can be hnd in 0(1) time [T1 ITU]. 

Initially, each lea node is undehned. We visit all leaves in Tc in a depth-hrst manner. 
When visiting a leaf i that is ambiguous and added for r G IR(0), we set lca(r) = £ if lca(r) 
is undehned, and lca(r) = lca(lca(r), £) otherwise. Since each lea operation takes 0(1) time, 
the whole process takes 0(|T(iV)|) time. □ 

Proposition 6.5 (i) Let i be a leaf in Tc that is neither ambiguous nor optional. If £ ^ B, 
B is not a soft cluster of any node u in the path from p{C) to £ in N. 

(ii) For each r G IR{C) such that c(r) ^ B, B is not a soft cluster of any u in the path 
from p{C) to lca(r) inclusively in N. 

Proof (i) Since £ is neither ambiguous nor optional, all the nodes in the path from p{C) to 
£ appears in any subtree T of N. Since £ ^ B, so, the cluster of each node in the path is not 
equal to B. 

(ii) For an inner reticulation node r below O, Ar contains at least two ambiguous leaves 
and thus lca(r) is an internal node in C. Any subtree T of iV contains exactly one incoming 
edge of r below lca(r). Thus, the cluster of each node u in the path from p{C) to lca(r) in 
T must contain c(r) and hence is not equal to R. □ 

Let T^^^ be the spanning subtree of Tc over {£ G T{Tc) \ £ ^ ^{Tc) U 0{Tc)} U 
Vjca where A{Tc) and 0{Tc) are the set of ambiguous and optional leaves in Tc, 

respectively. Clearly, is rooted at p{C). We further dehne VAax = {n G V{Tc) \ v ^ 
V(r,<,jaiidp(tOeV(r,<,J}. 
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Proposition 6.6 B is a soft cluster in Vp^{p{C)) if and only if a node v G V^ax exists such 
that for each i ^ B there is a leaf below v with the same label as i. 


Proof Assume i? is a soft cluster of a node u in VKi^piC)). By Proposition 6.5, u is not in 
Tjca and thus it is below some v G Idnax- For any i ^ B, u and hence v has a leaf descendant 
having the same label as i. 

Let V G Knax satisfy the property that for each i E B, there is a leaf i' having the same 
label as i. For each x G IR(C') such that c{x) ^ B, by the dehnition of VAax, Ax contains an 
ambiguous leaf not below v. For this x, we select a parent of r not below v in C. 

For each y G IR(C') such that c{y) G R, we select a parent Py below v. 

For each r G CR(C') such that c(r) G R, we select a parent Pr below v. 

Set 


E = {{p,r)E£{N)\pEV{C),rEA{Tc)UO{Tc)} 

—{{p'^^x) I X G IR(C) such that c{x) ^ B} 

—{{Py,y) I y G IR(C) such that c(r) G R} 

—{(p, r) I r G CR(C') such that c(r) G R} 

Then, Vi^{p{C)) — R is a subtree in which R is the cluster of v. It is not hard to see that 

Vjq{p{C)) — E can be extended into a subtree of N. □ 


Taken together, the above facts imply Algorithm 2 for determining whether a leaf 
subset is a soft cluster in a lowest big tree-node component or not, presented in Table 

The correctness of the Algorithm 2 follows from Propositions |6.5| and |6.6[ Step 1 takes 
constant time. Step 2 can be done in |c(m)|) time. Step 3 takes 0{\£(Tc)\) time 

(see pro]). By Proposition |6.4[ Step 4 can be done in 0{\£{Tc)\) time. 


Algorithm 2 

Input: Tc and a subset R of leaves in V]^{p{C))] 

1. If |R| == 1, output “Yes” and exit; 


2. Construct Tc dehned in Eqn. (10) and (11); 

3. Pre-process Tc so that the lea ot any two nodes can be found in 0(1) time; 

4. Traverse the leaves in Tc to compute the nodes in 

5. For each leaf i ^ A{C) U 0(0) such that i ^ B 

mark the nodes in the path from p{Tc) to it; 

For each r G IR(0) such that c(r) ^ R 

mark the nodes in the path from p{Tc) to lca(r) inclusively; 

6. Traverse the nodes u in Tc to compute the nodes in I4iax: 

check if u is unmarked and its parent is marked in Step 5 when visiting u; 

7. For each node u G I4iax { 

7.1 Check whether or not all leaves in R are below u; 

7.2 Output “Yes” and exit if so; 

8. Output “No” and exit; 


Table 4: An algorithm to decide whether R is a soft cluster in O in the non-binary case. 
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Two paths from p{Tc) to nodes in may have a common subpath starting at the 
root. We mark the nodes in each of these paths in a bottom-up manner: whenever we reach 
a marked node, we stop the marking process in the current path. In this way, each marked 
node is visited twice at most and hence Step 5 can be executed in 0{\£{Tc)\) time. 

Obviously, Step 6 takes 0{\£{Tc)\) time. For each node u, Step 7.1 takes 0{\£{'Dtc{u))\) 
time. Since all the examined subtrees are disjoint, the total time taken by Step 7.1 is 
0{\8{Tc)\) time. 

Taken together, these facts imply that Algorithm 2 is a line-time algorithm. Plugging 
Algorithm 2 into Step 2.2 in the CCP algorithm, we can solve the CCP in linear time. 

7 Conclusion 

Our algorithms are designed using a powerful decomposition theorem. The theorem holds for 
arbitrary reticulation-visible networks. We are very interested in exploring its applications 
in the estimate of the size of a network having a visibility property and designs of algorithms 
for reconstructing reticulation-visible networks from gene trees or gene sequences. Another 
interesting problem is how to determine whether two networks display the same set of binary 
trees in polynomial time. A solution for this is dehnitely valuable in phylogenetics. 


Acknowledgments 

The authors are grateful to Philippe Gambette, Anthony Labarre, and Stephane Vialette for 
discussion on the problems studied in this work. This work was supported by a Singapore 
MOE ARF Tier-1 grant R -146-000-177-112 and the Merlion Programme 2013. 


References 

[1] Bender, M. A. et ai: Lowest common ancestors in trees and directed acyclic graphs. J. 
Algorithms 57, 75-94 (2005) 

[2] Cardona, G., Rossello, F., Valiente, G.: Comparison of tree-child phylogenetic networks. 
lEEE-ACM Trans. Comput. Biol. Bioinform. 6, 552-569 (2009) 

[3] Chan, J.M., Carlsson, G., Rabadan, R.: Topology of viral evolution. PNAS 110, 18566- 
18571 (2013) 

[4] Dagan, T., Artzy-Randrup, Y., Martin, W.: Modular networks and cumulative impact 
of lateral transfer in prokaryote genome evolution.PARS' 105(29), 10039-10044 (2008) 

[51 Doolittle, W.F.: Phylogenetic classification and the universal tree. Science 248, 2124- 
2128 (1999) 

[6] Gambette, P., Gunawan, A.D.M., Labarre, A., Vialette, S., Zhang, L.X.: Locating a 
tree in a phylogenetic network in quadratic time. In Proc. of RECOMBA5, pp. 96-107, 
Springer (2015) 


24 



[7] Gambette, P., Gunawan, A.D.M., Labarre, A., Vialette, S., Zhang, L.X.: Solving the 
tree containment problem for genetically stable networks in quadratic time, submitted 
manuscript. 

[8] Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of 
NP-Completeness. Freeman Publisher, San Francisco, USA (1979) 

[9] Gusheld, D.: ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs 
and Explicit Phylogenetic Networks. The MIT Press (2014) 

[10] Harel, D., Tarjan, R.E.: Fast algorithms for Ending nearest common ancestors, SIAM 
J. Comput. 13: 338-355 (1984) 

[11] Huson, D.H., Klopper, T.H.; Beyond galled trees: decomposition and computation of 
galled networks. In: Proc. of RECOMB’2007, LNGS, vol. 4453, pp. 211-225, Springer 
(2007) 

[12] Huson, D.H., Rupp, R., Scornavacca, G.: Phylogenetic Networks: Concepts, Algorithms 
and Applications. Gambridge University Press (2011) 

[13] Kanj, LA., Nakhleh, L., Than, G., Xia, G.: Seeing the trees and their branches in the 
network is hard. Theor. Comput. Sci. 401, 153-164 (2008) 

[14] Lengauer T, Tarjan, R.E.: A fast algorithm for finding dominators in a fiowgraph. ACM 
Trans. Prog. Lang. Sys. 1, 121-141 (1979) 

[15] Linz, S., John, K. S., Semple, G.: Gounting trees in a phylogenetic network is ffP- 
complete. SIAM J. Comput. 42, 1768-1776 (2013) 

[16] Marcussen, T., et al: , Burkhard Steuernagel, Klaus F. X. Mayer, and Odd-Arne 
Olsen. Ancient hybridizations among the ancestral genomes of bread wheat. Science, 
DOL 10.1126/science. 1250092 (2014) 

[17] Moret, B.M.E., et ai: Phylogenetic networks: Modeling, reconstructibility, and accu¬ 
racy. lEEE-ACM TCBB. 1, 13-23 (2004) 

[18] Nakhleh, L.: Gomputational approaches to species phylogeny inference and gene tree 
reconciliation. Trends Ecol. Evol. 28(12), 719-728 (2013) 

[19] Treangen, T.J., Rocha, E.P.: Horizontal transfer, not duplication, drives the expansion 
of protein families in prokaryotes. PLoS Genetics 7, el001284 (2011) 

[20] van lersel, L., Semple, G., Steel, M.: Locating a tree in a phylogenetic network. Inform. 
Process. Letters 110, 1037-1043 (2010) 

[21] Wang, L., Zhang, K., Zhang, L.X.: Perfect phylogenetic networks with recombination. 
J. Comp. Biol 8, 69-78 (2001) 

[22] Zhang, L., Gui, Y.: An efficient method for DNA-based species assignment via gene tree 
and species tree reconciliation. In Proc. of WABO’2010, pp. 300-311. Springer (2010) 


25 



